Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs 187

Posted by timothy on Sunday August 29, 2010 @10:22PM from the quick-like-double-time dept.

An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."

This discussion has been archived. No new comments can be posted.

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

Load All Comments

Search 187 Comments Log In/Create an Account

Comments Filter:

Excel Charts (Score:3, Insightful)

by Anonymous Coward writes: on Sunday August 29, 2010 @10:30PM (#33412026)

I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by bramp ( 830799 ) writes:
  
  I've also found this annoying when reading papers. Perhaps I just spend too much time learning how to use gnuplot so that my graphs look nice.
- Re:Excel Charts (Score:4, Insightful)
  
  by Anpheus ( 908711 ) writes: on Sunday August 29, 2010 @11:21PM (#33412208)
  
  Maybe excel was just the right tool for the job? It's quick and easy to use, and to reformat the graphs.
  I know the Linux tools tend to be a little longer between tweaking, rendering and displaying, so a fast WYSIWIG tool works just fine.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by TheRaven64 ( 641858 ) writes:
    
    Maybe excel was just the right tool for the job?
    Given prior experience, this seems highly improbably for any possible value of 'the job'.
  - Re: (Score:2)
    
    by thePowerOfGrayskull ( 905905 ) writes:
    
    Indeed. When it comes to quickly making charts with minimal tweaking ... it's hard to beat Excel.
  - - Re: (Score:2)
      
      by dave420 ( 699308 ) writes:
      
      Maybe the screenshot isn't the finished paper, and just a sneak-peek into the results for those interested?
- Re:Excel Charts (Score:5, Insightful)
  
  by pspahn ( 1175617 ) writes: on Monday August 30, 2010 @02:29AM (#33412798)
  
  Actually, I find even more disappointing that a decent way to display datasets on a web page isn't standard yet. Why can't a nice one be embeddable with column sorts and robust methods for retrieving data? There are solutions, sure, but I have yet to find one that isn't unnecessarily complex or just plain ugly and difficult to use. But I guess it's just a matter of time, right?
  
  Parent Share
  twitter facebook
  - Re:Excel Charts (Score:4, Informative)
    
    by w0mprat ( 1317953 ) writes: on Monday August 30, 2010 @05:51AM (#33413324)
    
    Amen. Some tools like that would be a godsend. It could be coming. http://en.wikipedia.org/wiki/Linked_Data [wikipedia.org] http://linkeddata.org/ [linkeddata.org] - Not what you are talking about, but what you describe may result from it.
    
    Parent Share
    twitter facebook
  - Plain text files (Score:2)
    
    by mangu ( 126918 ) writes:
    
    Why can't they spend as much effort on learning the simple and effective tools that exist to work with text files as they spend learning Excel tricks?
    I have seen people pay for "Advanced Excel" training in order to do what I can do with a few lines of Unix utilities. The cost/benefit ratio of Excel is very poor.
    Plain text has another advantage in that it's the most universal standard for data that exists. I can read text files from the start of the computing age without too much problem. I can even scan pun
  - Re: (Score:2)
    
    by pinkushun ( 1467193 ) writes:
    
    Actually there is, I'm surprised that not more people use (or know) of it. See Google Chart API [google.com]
    - Re: (Score:2)
      
      by pspahn ( 1175617 ) writes:
      
      Which is fine and dandy, but it doesn't appear to simply embed a spreadsheet, excel or otherwise.
      I would like to use something like [sheet src="someurl.com/data.csv"]
- Re: (Score:3, Insightful)
  
  by dominious ( 1077089 ) writes:
  
  what. the. fuck. +4 Insightful because they use Excel charts?
  
  Hey I just solved N=NP.
  Yeah, but you are using Excel charts...hmmm sorry kthnx later.
- Re: (Score:3, Insightful)
  
  by multipartmixed ( 163409 ) writes:
  
  1. Why do you believe those are screen captures, rather than, say, exported images?
  2. How would the data look different it were displayed with BibTeX?
  3. How fast is using BibTeX? (I've never used it). I could create those same charts in Excel '97 from a CSV of input points easily; probably in under a minute.
- - - - Re: (Score:2)
        
        by udippel ( 562132 ) writes:
        
        Nevermind what the poster attempted do it.
        I for one found it quite funny.
        (Simply have no mod points now, so a comment should be in order.)
        
        Re: (Score:2)
        
        by Surt ( 22457 ) writes:
        
        Researchers at the University of Virginia have recently open sourced
        I stopped this shit about right there. You think I'm going to trust my sorting to some open sores buggy shit? I think I'll just keep using Microsoft for my algorithms thank you very much.
        I'm assuming that was an attempt at humor. Otherwise, you're going to get modded all to hell very soon.
        Just ask yourself: could someone possibly actually hold this opinion?
        He was clearly joking.
        
        Re: (Score:2)
        
        by ScrewMaster ( 602015 ) writes:
        
        Just ask yourself: could someone possibly actually hold this opinion? He was clearly joking.
        Yes, they can and do. I'm surprised you haven't experienced such opinions. I have, and it's unnerving, and frankly explains a lot of the problems we have trying to improve online security.
        
        For example, I know a couple of Ph.Ds that simply cannot understand how anyone could use an open source OS because "everyone can see how it works." I've pointed out that a truly secure software system is so by its very nature, regardless of whether you have the source code or not. In one ear and out the other: security
The video card in question.. (Score:5, Informative)

by black3d ( 1648913 ) writes: on Sunday August 29, 2010 @10:33PM (#33412038)

Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

Share
twitter facebook
- Re:The video card in question.. (Score:4, Funny)
  
  by MarkRose ( 820682 ) writes: on Sunday August 29, 2010 @11:47PM (#33412294) Homepage
  
  I never would have suspected the GTX480 would have been good at this sorta thing.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by DavidD_CA ( 750156 ) writes:
    
    That's the sort of joke I'd expect from Slashdot.
- Re: (Score:2)
  
  by Sycraft-fu ( 314770 ) writes:
  
  That's strange, I wonder what the 480 is faster? The 2070 should be the Tesla equivalent to the GTX 480. nVidia doesn't make different processors, they just change the configuration for their cards. Teslas have no display output and generally more RAM.
  - Re: (Score:2)
    
    by black3d ( 1648913 ) writes:
    
    It's quite possible a 2070 would perform just as well, except they tested it against a 2050. I thought the 2070 would actually perform better than the 480, due to screeds (still no idea if it's a word, but love it!) more RAM, but I presume it's simply what they had available. Maybe a dedicated 2070 wasn't on this year's budget? :)
    - Re: (Score:2)
      
      by Sycraft-fu ( 314770 ) writes:
      
      Oh didn't notice that. I presume the 2050 is the GTX470 (or maybe 460) equivalent, which would explain things.
  - Re: (Score:2, Informative)
    
    by FilipeMaia ( 550301 ) writes:
    
    The reason for the GTX480 being faster is that it has 15 SM compared to 14 from the Tesla 2050. Also the GTX 480 runs at a higher clock speed (700 compared to 575). Put together this is 575/700*14/15 = 76.7% which comes pretty close to the 75%.
  - Re: (Score:2, Informative)
    
    by ericcj ( 1574601 ) writes:
    
    Chips on the GTX 480, C2050, and C2070 come from the exact same die and wafer. The C20XX GPUs actually run at a lower clock speed for 32-bit floating-point and integer operations than a GTX 480.
    C20XX series hardware is intended for physics/science/engineering calculations, where double-precision is preferred. The C20XX series is 4 times faster at double-precision calculations than the GTX 480. This is the sweet spot it is tuned for.
No Surprise... (Score:5, Funny)

by Anonymous Coward writes: on Sunday August 29, 2010 @10:47PM (#33412082)

GPUs have always been better at sorting your money from your wallet.

Share
twitter facebook
- Nah (Score:2)
  
  by suomynonAyletamitlU ( 1618513 ) writes:
  
  It's more of a lossy compression.
Not a barrier (Score:5, Insightful)

by Captain Segfault ( 686912 ) writes: on Sunday August 29, 2010 @10:49PM (#33412084) Homepage Journal

This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.
Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!

Share
twitter facebook
- Re:Not a barrier (Score:5, Insightful)
  
  by XanC ( 644172 ) writes: on Sunday August 29, 2010 @10:51PM (#33412096)
  
  It's not a threshold! It's just a milestone.
  
  Parent Share
  twitter facebook
  - Re:Not a barrier (Score:4, Funny)
    
    by CoolGopher ( 142933 ) writes: on Sunday August 29, 2010 @10:59PM (#33412132)
    
    It's just a milestone.
    Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!
    No, this is not a serious comment.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Tumbleweed ( 3706 ) * writes:
      
      Hang on, since when do you measure sorting performance using a distance indicator? And an imperial one at that!
      The GTX480 can make the Kessel Run in 18 Parsecs, too.
      And why should I have to add "Kessel" to my spelling dictionary? Geez.
      - Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        The GTX480 can make the Kessel Run in 18 Parsecs, too.
        
        Pff, my Emacs can do the M-x kessel-run in O([No match]) time and space!
  - Re: (Score:2)
    
    by PatPending ( 953482 ) writes:
    
    It's just a milestone.
    I fail to see what a stone functioning as a milepost has to do with this.
    - Re: (Score:3, Funny)
      
      by martin-boundary ( 547041 ) writes:
      
      It means the Queen of England herself endorses the calculations?
      - Re: (Score:2)
        
        by asliarun ( 636603 ) writes:
        
        It means the Queen of England herself endorses the calculations?
        Yeah, you can count on her.
- Re: (Score:2)
  
  by Laxori666 ( 748529 ) writes:
  
  This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.
  Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!
  Yep. Actually, looking from the chart, it looks like they just thought 1G/s would be a pretty number, so they optimized their code until it was that fast, then called it a day.
- - Re: (Score:3, Insightful)
    
    by caerwyn ( 38056 ) writes:
    
    Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.
    - Re:Not a barrier (Score:5, Funny)
      
      by PatPending ( 953482 ) writes: on Sunday August 29, 2010 @11:35PM (#33412256)
      
      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.
      Actually in this case, your analogy should use ludicrous speed.
      
      Parent Share
      twitter facebook
    - Re: (Score:2)
      
      by ScrewMaster ( 602015 ) writes:
      
      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.
      Yes, it was [wikipedia.org]
    - Re: (Score:2)
      
      by jamesh ( 87723 ) writes:
      
      Actually, if you look at shockwave dynamics during the moment an object crosses from subsonic to supersonic velocity, it can very easily be considered much more of a barrier than 1gkeys/sec can.
      I take it you didn't feel the disturbance in the force when the 1gkeys/sec barrier was broken then? It felt like a million GPU marking executives all took a deep breath and shouted "w00t!"
      - Re: (Score:2)
        
        by jamesh ( 87723 ) writes:
        
        GPU marking executives
        and that should have been marketing executives obviously. I blame the migraine i am currently experiencing (presumably a result of said force disturbance :)
    - Re: (Score:2)
      
      by sznupi ( 719324 ) writes:
      
      Though the effects are in full swing quite a bit before "the moment an object crosses from subsonic to supersonic velocity"; the "barrier" is marked by entry into transonic flight, around 0.2 Mach below the speed of sound.
      - Re: (Score:2)
        
        by ockegheim ( 808089 ) writes:
        
        Cool, I know what transonic means now... thanks Slashdot & Wikipedia [wikipedia.org]!
Um... (Score:5, Insightful)

by Anonymous Coward writes: on Sunday August 29, 2010 @10:50PM (#33412088)

Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

Share
twitter facebook
- Re:Um... (Score:5, Interesting)
  
  by PrecambrianRabbit ( 1834412 ) writes: on Sunday August 29, 2010 @11:19PM (#33412196)
  
  Given that the particular hardware setup is detailed here (a GTX 480 achieves the 1 billion keys/sec figure), and the algorithm used (radix sort) has known asymptotic behavior (O(nk) for n keys of length k), 10^9 keys/sec is quite meaningful, particularly since it's a significant implementation challenge (possibly even an algorithmic challenge) to port this algorithm to a GPU.
  Furthermore, I think sorting speed is appropriately measured in keys/sec. Big-O does not in fact describe the speed, but rather the upper bound of the growth of an algorithm's asymptotic running time, which needs to be paired with the implementation, architecture, and data set to determine a speed. It turns out the constant factors can actually be quite important in practice.
  
  Parent Share
  twitter facebook
  - Also (Score:5, Interesting)
    
    by Sycraft-fu ( 314770 ) writes: on Monday August 30, 2010 @01:30AM (#33412624)
    
    CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:
    1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.
    2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.
    People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by chammy ( 1096007 ) writes:
      
      This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics.
      The spheres would most likely be represented as an equation, not a soup of polygons. It's much more efficient (not to mention a whole lot easier) to raytrace. It's also infinitely precise, which is actually why a lot of people are more interested in raytracing than approximating things with polygons.
      
      For instance, it's a heck of a lot easier to render isosurfaces [demon.co.uk] in a raytracer than turning them into a big polygon soup and rasterizing that.
    - Re: (Score:2)
      
      by Thiez ( 1281866 ) writes:
      
      I don't think it's a fair comparison, really. The only reason modern games look so pretty is because they have the GPU that happens to be designed to help them out. If companies had been designing GPUs to do raytracing for over a decade, it seems rather likely to me that we could do raytracing at modern resolutions.
      Wouldn't it be more fair to compare raytracing to rasterization when you're only allowed to use the CPU?
      - Re: (Score:2)
        
        by sznupi ( 719324 ) writes:
        
        Not quite, you can just buy Larab...oh, wait, it seems there were some problems with its goals...
    - Re: (Score:2)
      
      by smallfries ( 601545 ) writes:
      
      Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing.
      Not sure what your point is there. Any algorithm for rendering that spits out n pixels at the end has an O(n) lower-bound.
      - Re: (Score:2)
        
        by Surt ( 22457 ) writes:
        
        Different ns.
    - Re: (Score:2)
      
      by jonaskoelker ( 922170 ) writes:
      
      People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.
      Heck, why don't we computer scientists do the last step of this whole science thing and come up with a model that describes the actual reality more accurately?
      Oh, we do. Consider the I/O model, the cache-oblivious model or (for a fun combination) the parallel I/O model. (See for instance Aggarwal & Vitter: The Input/Output Complexity of Sorting and Related Problems.)
      if the hardware can do X faster than Y, then X is better according to users.
      Well, users as a collection run your algorithm with a distribution of inputs. You can't say that an algorithm is faster than another (ei
    - Re: (Score:2, Interesting)
      
      by Anonymous Coward writes:
      
      You are missing the same part almost every other CS student out there misses
      K*O(N)+C
      There is usually some K involved that is done on every operation. This can turn a log n op into something huge. Or the set up (the C part) could take longer than to just brute force the issue. Many people make this mistake. However, hardware does not stand still either. People expect next years graphics to be more photo realistic than last years.
      You use raster vs ray (because you hate how long ray takes) as the reason y
    - Re:Also (Score:4, Interesting)
      
      by pz ( 113803 ) writes: on Monday August 30, 2010 @09:33AM (#33414256) Journal
      
      People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.
      Right. And any good programmer understands that *first* you pick the right algorithm, and *then* you optimize the implementation. Working the other way around is wasting time.
      But, more importantly, that the parent seems to miss, is that the speed improvement from changing the order of growth of an algorithm can swamp nearly any possible improvements in hardware. Going from O(n^4) to O(n^2) when n is 10,000 is like replacing your desktop box with ten thousand supercomputers. No amount of implementation tweaking or upgrading processors and memory -- which only affects the constants in algorithm speed -- is going to give you the same level of speedup.
      There is a very, very good reason to pay attention to O() analysis of algorithms, and, in particular, their implementation. You can implement bubble sort which is O(n^2) when done correctly, in a brain-dead way that makes it O(n^3) --- if you, for example, re-extend the output array for each new element --- and the difference is huge. Extra factors of n creep in easily, and understanding them can be highly useful.
      So, the parent can review real-world constraints all he wants, but in the end, order of growth is more important.
      
      Parent Share
      twitter facebook
    - Re: (Score:2)
      
      by pitchpipe ( 708843 ) writes:
      
      You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.
      Reminds me of the quote:
      In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
    - - Re: (Score:2)
        
        by Surt ( 22457 ) writes:
        
        It's not that the operations aren't equal cost that is the problem in real implementations, it's the number of them. E.g., if my O(n) sort requires a million instructions setup, and a few hundred per bucket assignment, that may significantly impact how it compares in performance to an O(n log(n)) sort that requires only a hundred instructions of setup, and 3 operations per swap.
- Re:Um... (Score:5, Interesting)
  
  by SpazmodeusG ( 1334705 ) writes: on Sunday August 29, 2010 @11:36PM (#33412258)
  
  I wish they'd start putting the "P" into these Big-O notations, where the "P" is the number of processors. Some algorithms don't scale well, some do. Putting the P in illustrates this.
  eg. O( n/P ) illustrates an algorithm that scales perfectly with more cores added. O( n / log(P) ) not so much.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Funny)
    
    by Guido del Confuso ( 80037 ) writes:
    
    O( n / log(P) ) not so much.
    That algorithm does particularly poorly on just one processor. In fact, if it ran successfully the universe would implode. [mathfail.com]
  - Re: (Score:3, Interesting)
    
    by frosty_tsm ( 933163 ) writes:
    
    While an interesting idea, many algorithms behave unpredictably on multiple processors (depending on how much communication would be required). Some will even be slower!
    
    The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.
  - Re: (Score:3, Informative)
    
    by Anonymous Coward writes:
    
    Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."
    The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).
    Here is the relevant Wikipedia article [wikipedia.org].
  - Re: (Score:2)
    
    by Kjella ( 173770 ) writes:
    
    Big-O notation does not describe the function on a physical processor with caches and pipelines, it's a pure mathemathical concept counting mathematical operations. Doubling n may mean things no longer fit in L1/L2/L3/RAM which will have huge performance implications, there is no simple notation for an actual implementation.
    Adding processors would be just one factor, what's the latency and bandwidth of the interconnects? What memory can they access and at what cost, NUMA plays a big role. There's a huge, hu
- Re: (Score:2)
  
  by Concerned Onlooker ( 473481 ) writes:
  
  Obviously you weren't close enough to hear the sonic boom.
- Re: (Score:2)
  
  by Lobais ( 743851 ) writes:
  
  It doesn't only depend on hardware setup. It depends on the number of keys!
  Unless the algorithm is actually O(n) and not O(n log n), the speed drops width the increase of keys to sort.
- - Ugh. (Score:3, Insightful)
    
    by martin-boundary ( 547041 ) writes:
    
    Stupid HTML ate my <...
    The problem with big-oh notation is that the constant isn't explicit, so for any given n (pick as large as desired), it is possible that O(nlogn) < O(n) for some choice of constants. That's why ops per second is still a useful metric when comparing implementations on standardized hardware.
    As always, in theory there's no difference between theory and practice, but in practice there is...
    - - Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        That's not the range where the constants matter, though. Take log(10^12) = 28 approx, then for n=10^9, an algo that's nlogn is equivalent to (or even better than) an algo that's 28n over the range [1,n]. Now a factor of 30 in practical implementations is not that rare. Penalties due to caching effects or disk I/O can amount to much more and amortized over the full runtime could give that. So I don't think it's that pathological to expect some pairs of algos to compare oppositely in theory and in implementat
      - Re: (Score:3, Informative)
        
        by metacell ( 523607 ) writes:
        
        Dude, an algorith which is O(n*log(n)) is not faster than O(n) just because n*log(n) < n.
        When an algorithm is O(n* log(n)), it means the actual time requirement is p*n*log(n)+q, where p and q are constants specific to the algorithm.
        The O(n*log(n)) algorith is faster than the O(n) one when
        p1*n*log(n)+q1 < p2*n+q2
        ... and for any n, it is possible to choose p1, p2, q1 and q2 so that the O(n) algorith becomes faster.
        This means, for example, that an algorithm which is O(n*log(n)) isn't automatically faster
        
        Re: (Score:2)
        
        by SashaM ( 520334 ) writes:
        
        Uhh, what?
        An algorithm that runs in n*log(n) is slower than n.
        Also, big-O is the wrong thing to use here - you want little-o (or little-omega). An algorithm being O(f(n)) only means that the algorithm runs in (asymptotically) less than or same number of of steps as f(n). An algorithm that runs in k*log(n)+p steps is in O(n^2) and in O(n) and in O(log(n)). To show that an algorithm A (that runs in f(n) steps) is faster than algorithm B (that runs in g(n) steps) you need to show that g(n) is in w(f(n)) ('
      - Re: (Score:2)
        
        by gmthor ( 1150907 ) writes:
        
        I take it, that you don't get the big O notation correct. As the constants are different, it makes no sense to calculate when nlog(n) < n.
        
        Ask yourself, when is C*n*log(n) < K*n, for unknown C and K(these constants depend on the implementation) and you will see that the only way to determine this is measuring the result of your implementation.
Link to Technical Paper (Score:5, Informative)

by PatPending ( 953482 ) writes: on Sunday August 29, 2010 @10:59PM (#33412126)

Revisiting Sorting for GPGPU Stream Architectures [virginia.edu] (PDF)

Share
twitter facebook
I think the bubble sort would be the wrong way to (Score:4, Funny)

by joeyadams ( 1724334 ) writes: on Sunday August 29, 2010 @11:12PM (#33412168)

I think the bubble sort would be the wrong way to go.

—Barack Obama

Share
twitter facebook
- Re: (Score:2)
  
  by DirePickle ( 796986 ) writes:
  
  That guy was pretty cool. Whatever happened to him?
New level of gaming. (Score:2, Funny)

by Murdoch5 ( 1563847 ) writes:

Hold on, so I can play Jezz Ball, Chips Challenge and all my favourite old school games with a greater level of speed.
PRON! Marches on! (Score:3, Funny)

by jewishbaconzombies ( 1861376 ) writes: on Sunday August 29, 2010 @11:17PM (#33412184)

Thank god, because my online Porn collection isn't going to sort itself.

Share
twitter facebook
Big deal. Radix sort works well IF ... (Score:4, Interesting)

by crovira ( 10242 ) writes: on Sunday August 29, 2010 @11:21PM (#33412204) Homepage

you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.
Pass 1, read the records, at the key radix store a record URI
Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.
I was doing this years ago.
If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.
You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)

Share
twitter facebook
- Re:Big deal. Radix sort works well IF ... (Score:5, Insightful)
  
  by Trepidity ( 597 ) writes: <delirium-slashdo ... g ['hac' in gap]> on Monday August 30, 2010 @01:28AM (#33412618)
  
  Well, yeah, they're not claiming they invented radix sort. They're claiming that their GPU implementation of radix sort runs about 4x as fast as the CPU implementation you describe.
  
  Parent Share
  twitter facebook
  - Re:Big deal. Radix sort works well IF ... (Score:4, Interesting)
    
    by kramulous ( 977841 ) writes: on Monday August 30, 2010 @03:18AM (#33412946)
    
    So, does that mean if they went out and got the fastest Xeon processor (they used the fastest gpu - excluding the C2070), wrote parallel and used the Intel Compiler (writing to it) the speedup over the cpu is less than zero?
    After having just looked at their code, also remove the cpu stl shit (actually any template since they don't vectorise). If you are writing speedy code for the gpu, to perform an adequate test for the cpu you also have to write appropriately.
    hahahahaha
    This gets better and better...
    They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.
    Sorry, I only count 'time to solution'. That is all i'm interested in. I thought is was strange that a less than O(n**2) was faster on the gpu.
    It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by julesh ( 229690 ) writes:
      
      This gets better and better...
      They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.
      Sorry, I only count 'time to solution'. That is all i'm interested in.
      Total time including initialisation and reporting result is only useful if what you're looking at is a complete process -- but when was the last time you just needed to sort something and that sorting wasn't part of a larger algorithm? It's at least not totally out of the question that the data will already be in GPU memor
- Re: (Score:3, Insightful)
  
  by evilWurst ( 96042 ) writes:
  
  It's generally not size of RAM that breaks radix sort; it's the size of cache. Modern processors are highly reliant on cache, which means they're highly reliant on things in memory being in small tight chunks that fit in cache - because cache misses are expensive enough that if you thrash cache badly enough, you may end up running slower than if you hadn't had any cache at all.
  Good comparison sorts may start fragmented, but by their very nature each pass of the algorithm makes them less so. Radix sort is th
most real life sorting involves indirection (Score:2, Insightful)

by Anonymous Coward writes:

The typical sorting problem I've encountered in my career (various types of scientific, telecommunications and business software, though not games) involves an array of pointers to fixed length records that have a sort key (let's say an integer) at a predefined offset. Not an array of integers, nor an array of directly embedded small fixed length records which I'm guessing was used in TFA. The former situation requires random as well as stream access to memory, which would likely favor processing by the CP
- Re: (Score:2)
  
  by MadKeithV ( 102058 ) writes:
  
  Solution: preprocess the data into a map of keys to pointers, and sort that.
Efficient? (Score:2, Interesting)

by scotty.m ( 1881826 ) writes:

Not sure how they are defining efficient in this experiement. I'd like to see how many clock cycles it took to sort each problem, maybe how much memory the radix sort is using too.
- Re: (Score:2)
  
  by Trepidity ( 597 ) writes:
  
  Wallclock time: they're claiming that this is, in absolute numbers, the fastest sort in keys-per-second yet demonstrated on commodity hardware.
1 billion? Up it to over 4 billion! (Score:4, Funny)

by Mr Z ( 6791 ) writes: on Monday August 30, 2010 @04:52AM (#33413152) Homepage Journal

You know, if they up it to just a bit over 4 billion unique 32-bit keys, say around 4,294,967,296 or so, I think I could sort them rather efficiently, as long as they weren't attached to any payload. ;-)

Share
twitter facebook
Memory bandwidth (Score:2)

by brunes69 ( 86786 ) writes:

I suspect your problem is going to become getting the keys in and out of the GPU memory, not the time the GP takes to sort them.
- Re:x86 (Score:5, Informative)
  
  by emmons ( 94632 ) writes: on Monday August 30, 2010 @12:14AM (#33412374) Homepage
  
  GPUs are highly parallel processors, but most of our computing algorithms were developed for fast single core processors. As we figure out how to implement new solutions to old problems to take advantage of these highly parallel processors, you'll continue to see stories like this one. But, there's a limit to how good they can be at certain types or problems. Read up on Amdahl's law.
  Basically, traditional x86 processors are good at lots of stuff. Modern GPUs are great at a few things.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by JohnnyBGod ( 1088549 ) writes:
    
    While it is a nice theoretical result and provides food for thought, generally Amdahl's law isn't that meaningful, since the parallelizable portion of a given problem usually increases along with the problem's size. There are exceptions, of course, but normally people just throw bigger problems at the hardware.
    - Re: (Score:2)
      
      by emmons ( 94632 ) writes:
      
      Very true. There are other limitations to GPUs as well. They don't handle branching well at all, some scatter/gather operations are slow, etc.
      But don't get me wrong; you'd have to pry my dual ATI (AMD) Radeon 5870's from my cold, dead fingers. And hopefully soon I'll get my hands on a GTX 485... :)
- Re: (Score:2)
  
  by Mr Z ( 6791 ) writes:
  
  Ok, so the page lists a 240M keys/sec sorting implementation for a quad-core Core i7, versus their 1G keys/sec for the GPU. How many processors were on that GPU? It's a 4x speedup end-to-end, sure, but it sure didn't scale linearly in the number of processors.
  Computing tasks will always be a mix of truly serial vs. mostly serial but parallelizable with herculean effort vs. embarrassingly parallel (and all points in between). GPUs work on the embarrassingly parallel stuff with ease, and fall off as you ge
- No (Score:5, Informative)
  
  by Sycraft-fu ( 314770 ) writes: on Monday August 30, 2010 @12:29AM (#33412408)
  
  GPUs are special kinds of processors, often called stream processors. They are very efficient at some kinds of operations, and not efficient at others. Some things, they run literally a thousand times faster than the CPU. Graphics rasterization would be one of these (no surprise, that's their prime job). However other things they run much slower. For something to run fast on a GPU it has to meet the following requirement, the more it matches them, the faster it is:
  1) It needs to be parallel to a more or less infinite extent. GPUs are highly parallel devices. The GTX 480 in question has 448 shaders, meaning for max performance it needs to be working on 448 things in parallel. Things that are only somewhat parallel don't work well.
  2) It needs to not have a whole lot of branching. Modern GPUs can branch, but they incur a larger penalty than CPUs do. So branching in the code needs to be minimal. It needs to mostly be working down a known path.
  3) When a branch happens, things need to branch the same way. The shaders work in groups with regards to data and instructions. So if you have half a group branching one way, half the other, that'll slow things down as it'll have to be split out and done separately. So branches need to be uniform for the most part.
  4) The problem set needs to fit in to the RAM of the GPU. This varies, 1GB is normal for high end GPUs and 4-6GB is possible for special processing versions of those. The memory on board is exceedingly fast, over a hundred gigabytes per second in most cases. However the penalty for hitting the main system RAM is heavy, the PCIe bus is but a fraction of that. So you need to be able to load data in to video RAM and work on it there, only occasionally going back and forth with the main system.
  5) For very best performance, your problem needs to be single precision floating point (32-bit). That is what GPUs like the best. Very modern ones can do double precision as well, but at half the speed. I don't know how their integer performance fares over all, they can do it, but again not the same speed as single precision FP.
  Now this is very useful. There are a lot of problems that fall in that domain. As I said, graphics would be one of the biggest, hence why they exist. However there are many problems that don't. When you get ones that are way outside of that, like, say, a relational database, they fall over flat. A normal CPU creams them performance wise.
  That's why we have the separate components. CPUs can't do what GPUs do as well, but they are good at everything. GPUs do particular things well, but other things not so much.
  In fact this is taken to the extreme in some electronics with ASICs. They do one and only one thing, but are fast as hell. Gigabit switches are an example. You find that tiny, low power, chips can switch amazing amounts of traffic. Try it on a computer with gigabit NICs and it'll fall over. Why? Because those ASICs do nothing but switch packets. They are designed just for that, with no extra hardware. Efficient, but inflexible.
  
  Parent Share
  twitter facebook
  - Re:No (Score:4, Insightful)
    
    by black3d ( 1648913 ) writes: on Monday August 30, 2010 @12:52AM (#33412492)
    
    Unfortunately I can't mod having already posted in this thread, but please allow me to /bow. This is the best explanation I've ever read anywhere for the differences. Even I knew the differences but couldn't have expressed it so finely. Bravo.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Sycraft-fu ( 314770 ) writes:
      
      Thanks :). I've had to give that speech to professors a number of times so I've some practice. Some faculty don't understand the whole GPGPU thing well and think they are a panacea for speeding up things. We (computer support) need to explain to them the limits, so they can evaluate if their research is the kind that would benefit.
  - Re: (Score:2, Informative)
    
    by FlawedLogic ( 1062848 ) writes:
    
    The GTX480 can actually do a double precision op per clock cycle. Fermi was designed with DP supercomputing in mind which is why it's so bloody expensive. To get the price down for consumer cards they removed that ability since graphics doesn't generally need it. Consumer cards need four ticks to do the equivalent DP op.
    - Re: (Score:2)
      
      by Sycraft-fu ( 314770 ) writes:
      
      The GTX480 is the consumer grade part. Quadros are the pro graphics cards, Teslas are the pro GPGPU units, GTXs are consumer graphics cards.
      - Re: (Score:2)
        
        by Surt ( 22457 ) writes:
        
        About the same, because (take your pick):
        the fp op rate shouldn't matter to this algorithm as far as I can tell.
        the fp op rate is the same in tesla/gtx480 (ignoring minor clock speed differences)
        There are SOME slower consumer gtx cards, where the fp rate is 1/4 of the gtx480/tesla, but since those aren't in the picture ...
  - - Re: (Score:2)
      
      by Surt ( 22457 ) writes:
      
      They cheat in the rasterization phase rather than in the setup phase, generally, so anything written in CUDA should be pretty safe.
  - - Re: (Score:2)
      
      by TheRaven64 ( 641858 ) writes:
      
      FFT was one of the first algorithms to be implemented in shaders. It's used in a lot of image coding things, so GPUs often include dedicated instructions that do parts of it. nVidia actually includes FFT as one of the examples in the CUDA SDK, as I recall.
- Re: (Score:2, Interesting)
  
  by jhachen ( 1889398 ) writes:
  
  The answer to hitting a wall with traditional CPUs is complicated. The number of transistors on new CPUs is actually keeping up with Moore's law. The size of the transistors and power consumption is also steadily decreasing. However clock speeds have been stagnant and performance/clock cycle hasn't been increasing as fast as it has in the past.
  
  When it comes to raw performance numbers GPUs destroy CPUs. The problem is trying to take advantage of the power GPUs offer. For starters the algorithm has to
- Re: (Score:2)
  
  by chill ( 34294 ) writes:
  
  A Benny Hill skit is the first thing that pops to mind.
  Theme music here, for those too young to remember. (URL:http://www.youtube.com/watch?v=MK6TXMsvgQg)
- Re: (Score:2)
  
  by kramulous ( 977841 ) writes:
  
  Goto the source; They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.
  Sorry, I only count 'time to solution'. That is all i'm interested in. I thought it was strange that a less than O(n**2) was faster on the gpu.
  It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.
- - - Re: (Score:2)
      
      by mangu ( 126918 ) writes:
      
      The proof that the number of at least two digits in the decimal expansion of pi is unbounded is trivial, because otherwise we would come to a point where there would be just one digit repeating endlessly and pi would be rational.
      My math skills aren't good enough to prove that all of the ten digits must be unbounded, but I suspect this must be the case.
- Re: (Score:2)
  
  by Chatterton ( 228704 ) writes:
  
  But a lot of problems would still have solutions if there was no "I wonder if" scientists. I know some orphan disease (some of them are less than 100 know cases in the world) who are only worked by "I wonder if" scientists and none of the big pharmaceutical companies.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Excel Charts (Score:3, Insightful)

Re: (Score:3, Funny)

Re:Excel Charts (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Excel Charts (Score:5, Insightful)

Re:Excel Charts (Score:4, Informative)

Plain text files (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The video card in question.. (Score:5, Informative)

Re:The video card in question.. (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2, Informative)

No Surprise... (Score:5, Funny)

Nah (Score:2)

Not a barrier (Score:5, Insightful)

Re:Not a barrier (Score:5, Insightful)

Re:Not a barrier (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re:Not a barrier (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Um... (Score:5, Insightful)

Re:Um... (Score:5, Interesting)

Also (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Interesting)

Re:Also (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:Um... (Score:5, Interesting)

Re: (Score:3, Funny)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Ugh. (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Link to Technical Paper (Score:5, Informative)

I think the bubble sort would be the wrong way to (Score:4, Funny)

Re: (Score:2)

New level of gaming. (Score:2, Funny)

PRON! Marches on! (Score:3, Funny)

Big deal. Radix sort works well IF ... (Score:4, Interesting)

Re:Big deal. Radix sort works well IF ... (Score:5, Insightful)

Re:Big deal. Radix sort works well IF ... (Score:4, Interesting)

Re: (Score:2)

Re: (Score:3, Insightful)

most real life sorting involves indirection (Score:2, Insightful)

Re: (Score:2)

Efficient? (Score:2, Interesting)